Finding community structure in very large networks 
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The discovery and analysis of community structure in networks is a topic of considerable recent 
interest within the physics community, but most methods proposed so far are unsuitable for very 
large networks because of their computational cost. Here we present a hierarchical agglomeration 
algorithm for detecting community structure which is faster than many competing algorithms: its 
running time on a network with n vertices and m edges is O (md log n) where d is the depth of 
the dendrogram describing the community structure. Many real-world networks are sparse and 
hierarchical, with m ~ n and d ~ logn, in which case our algorithm runs in essentially linear time, 
0(nlog 2 n). As an example of the application of this algorithm we use it to analyze a network of 
items for sale on the web-site of a large online retailer, items in the network being linked if they 
are frequently purchased by the same buyer. The network has more than 400 000 vertices and 2 
million edges. We show that our algorithm can extract meaningful communities from this network, 
revealing large-scale patterns present in the purchasing habits of customers. 
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I. INTRODUCTION 



Many systems of current interest to the scientific com- 
munity can usefully be represented as networks Q, 0, 0, 
0] . Examples include the Internet Q and the world-wide 
web |(| U , social networks , citation networks 0, 0] , 
food webs , and biochemical networks [l2l IT^ | . Each 
of these networks consists of a set of nodes or vertices 
representing, for instance, computers or routers on the 
Internet or people in a social network, connected together 
by links or edges, representing data connections between 
computers, friendships between people, and so forth. 

One network feature that has been emphasized in 
recent work is community structure, the gathering of 
vertices into groups such that there is a higher den- 
sity of edges within groups than between them [l4| . 
The problem of detecting such communities within net- 
works has been well studied. Early approaches such 
as the Kernighan-Lin algorithm |15| . spectral partition- 
ing [iH . [Til , or hierarchical clustering 01 work well for 
specific types of problems (particularly graph bisection or 
problems with well defined vertex similarity measures) , 
but perform poorly in more general cases [l9j. 

To combat this problem a number of new algorithms 
have been proposed in recent years. Girvan and New- 
man pol l2l| proposed a divisive algorithm that uses 
edge betweenness as a metric to identify the bound- 
aries of communities. This algorithm has been ap- 
plied successfully to a variety of networks, including 
networks of email messages, human and animal so- 
cial networks, networks of collaborations between scien- 
tists and musicians, metabolic networks and gene net- 
works [U mil |2J, HI |H IHIH However, 
as noted in |2l|. the algorithm makes heavy demands on 
computational resources, running in 0(m 2 n) time on an 
arbitrary network with m edges and n vertices, or 0(n 3 ) 
time on a sparse graph (one in which m ~ n, which covers 
most real- world networks of interest). This restricts the 



algorithm's use to networks of at most a few thousand 
vertices with current hardware. 

More recently a number of faster algorithms have been 
proposed [in], m> |33| • In [32] , one of us proposed an algo- 
rithm based on the greedy optimization of the quantity 
known as modularity |2lj . This method appears to work 
well both in contrived test cases and in real-world situ- 
ations, and is substantially faster than the algorithm of 
Girvan and Newman. A naive implementation runs in 
time 0((m + n)n), or 0(n 2 ) on a sparse graph. 

Here we propose a new algorithm that performs the 
same greedy optimization as the algorithm of [32J and 
therefore gives identical results for the communities 
found. However, by exploiting some shortcuts in the op- 
timization problem and using more sophisticated data 
structures, it runs far more quickly, in time O(mdlogn) 
where d is the depth of the "dendrogram" describing the 
network's community structure. Many real-world net- 
works are sparse, so that m ~ n; and moreover, for net- 
works that have a hierarchical structure with communi- 
ties at many scales, d ~ logn. For such networks our al- 
gorithm has essentially linear running time, 0(nlog 2 n). 

This is not merely a technical advance but has sub- 
stantial practical implications, bringing within reach the 
analysis of extremely large networks. Networks of ten 
million vertices or more should be possible in reasonable 
run times. As an example, we give results from the ap- 
plication of the algorithm to a recommender network of 
books from the online bookseller Amazon.com, which has 
more than 400 000 vertices and two million edges. 



II. THE ALGORITHM 

Modularity |2lJ is a property of a network and a spe- 
cific proposed division of that network into communities. 
It measures when the division is a good one, in the sense 
that there are many edges within communities and only 
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a few between them. Let A vw be an element of the ad- 
jacency matrix of the network thus: 



Ayw — 



1 if vertices v and w are connected, 
otherwise. 



(1) 



and suppose the vertices are divided into communities 
such that vertex v belongs to community c v . Then the 
fraction of edges that fall within communities, i.e., that 
connect vertices that both lie in the same community, is 



^2vw A vw S(c v , Cw) _ 1 ^ A X{ r 

V 1 A o™ vw v > v 



(2) 



where the ^-function S(i,j) is 1 if i = j and otherwise, 
and m = | J2 V w A V w is the number of edges in the graph. 
This quantity will be large for good divisions of the net- 
work, in the sense of having many within-community 
edges, but it is not, on its own, a good measure of com- 
munity structure since it takes its largest value of 1 in 
the trivial case where all vertices belong to a single com- 
munity. However, if we subtract from it the expected 
value of the same quantity in the case of a randomized 
network, we do get a useful measure. 

The degree k v of a vertex v is defined to be the number 
of edges incident upon it: 



(3) 



The probability of an edge existing between vertices v 
and w if connections are made at random but respecting 
vertex degrees is k v k w /2m. We define the modularity Q 
to be 
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(4) 



If the fraction of within-community edges is no different 
from what we would expect for the randomized network, 
then this quantity will be zero. Nonzero values represent 
deviations from randomness, and in practice it is found 
that a value above about 0.3 is a good indicator of sig- 
nificant community structure in a network. 

If high values of the modularity correspond to good di- 
visions of a network into communities, then one should be 
able to find such good divisions by searching through the 
possible candidates for ones with high modularity. While 
finding the global maximum modularity over all possible 
divisions seems hard in general, reasonably good solu- 
tions can be found with approximate optimization tech- 
niques. The algorithm proposed in |32j uses a greedy 
optimization in which, starting with each vertex being 
the sole member of a community of one, we repeatedly 
join together the two communities whose amalgamation 
produces the largest increase in Q. For a network of n 
vertices, after n — 1 such joins we are left with a single 
community and the algorithm stops. The entire process 
can be represented as a tree whose leaves are the ver- 
tices of the original network and whose internal nodes 



correspond to the joins. This dendrogram represents a 
hierarchical decomposition of the network into commu- 
nities at all levels. 

The most straightforward implementation of this idea 
(and the only one considered in |32|) involves storing the 
adjacency matrix of the graph as an array of integers 
and repeatedly merging pairs of rows and columns as 
the corresponding communities are merged. For the case 
of the sparse graphs that are of primary interest in the 
field, however, this approach wastes a good deal of time 
and memory space on the storage and merging of matrix 
elements with value 0, which is the vast majority of the 
adjacency matrix. The algorithm proposed in this paper 
achieves speed (and memory efficiency) by eliminating 
these needless operations. 

To simplify the description of our algorithm let us de- 
fine the following two quantities: 



— y~] A vw S(c v ,i)S(c w ,j), 



(5) 



which is the fraction of edges that join vertices in com- 
munity i to vertices in community j, and 



<2j — y k v 5(c v ,i), 
2m e - — ' 



(6) 



which is the fraction of ends of edges that are attached 
to vertices in community i. Then, writing 6(c v ,c w ) = 
J2 i 5{c v , i)S(c Wj i) , we have, from Eq. 
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The operation of the algorithm involves finding the 
changes in Q that would result from the amalgamation of 
each pair of communities, choosing the largest of them, 
and performing the corresponding amalgamation. One 
way to envisage (and implement) this process is to think 
of network as a multigraph, in which a whole community 
is represented by a vertex, bundles of edges connect one 
vertex to another, and edges internal to communities are 
represented by self-edges. The adjacency matrix of this 
multigraph has elementsA^- = 2meij, and the joining of 
two communities i and j corresponds to replacing the 
ith and jth rows and columns by their sum. In the algo- 
rithm of [3^] this operation is done explicitly on the entire 
matrix, but if the adjacency matrix is sparse (which we 
expect in the early stages of the process) the operation 
can be carried out more efficiently using data structures 
for sparse matrices. Unfortunately, calculating AQy and 
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finding the pair i,j with the largest AQij then becomes 
time-consuming. 

In our new algorithm, rather than maintaining the ad- 
jacency matrix and calculating AQij, wc instead main- 
tain and update a matrix of value of AQij. Since joining 
two communities with no edge between them can never 
produce an increase in Q, we need only store AQij for 
those pairs i,j that are joined by one or more edges. 
Since this matrix has the same support as the adjacency 
matrix, it will be similarly sparse, so we can again rep- 
resent it with efficient data structures. In addition, wc 
make use of an efficient data structure to keep track of 
the largest AQij. These improvements result in a con- 
siderable saving of both memory and time. 

In total, we maintain three data structures: 

1. A sparse matrix containing AQij for each pair i,j 
of communities with at least one edge between 
them. We store each row of the matrix both as 
a balanced binary tree (so that elements can be 
found or inserted in 0(log n) time) and as a max- 
heap (so that the largest element can be found in 
constant time). 

2. A max-heap H containing the largest element of 
each row of the matrix AQij along with the labels 
i, j of the corresponding pair of communities. 

3. An ordinary vector array with elements aj. 

As described above we start off with each vertex being 
the sole member of a community of one, in which case 
e.y = l/2m if i and j are connected and zero otherwise, 
and aj = ki/2m. Thus we initially set 

. n J l/2m — kikj j (2m) 2 if i,j are connected, 
3 ~ 1 otherwise, 

(8) 

and 



for each i. (This assumes the graph is unweighted; 
weighted graphs are a simple generalization |3^|.) 
Our algorithm can now be defined as follows. 

1. Calculate the initial values of AQij and accord- 
ing to JHJ and {HI , and populate the max-heap with 
the largest element of each row of the matrix AQ. 

2. Select the largest AQij from H, join the corre- 
sponding communities, update the matrix AQ, the 
heap H and (as described below) and increment 
Q by AQij. 

3. Repeat step 2 until only one community remains. 

Our data structures allow us to carry out the updates 
in step 2 quickly. First, note that we need only adjust 
a few of the elements of AQ. If we join communities i 
and j, labeling the combined community j, say, we need 



only update the jth row and column, and remove the 
ith row and column altogether. The update rules are as 
follows. If community k is connected to both i and j, 
then 

AQ' jk = AQ lk + AQ jk (10a) 
If k is connected to i but not to j, then 

AQ' jk = AQ lk - 2 aj a k (10b) 
If k is connected to j but not to i, then 

AQ' jk = AQ jk - 2<na k . (10c) 

Note that these equations imply that Q has a single peak 
over the course of the algorithm, since after the largest 
AQ becomes negative all the AQ can only decrease. 

To analyze how long the algorithm takes using our 
data structures, let us denote the degrees of i and j 
in the reduced graph — i.e., the numbers of neighboring 
communities — as |z| and |j| respectively. The first op- 
eration in a step of the algorithm is to update the jth 
row. To implement Eq. HlOajl . we insert the elements of 
the ith row into the jth row, summing them wherever an 
element exists in both columns. Since we store the rows 
as balanced binary trees, each of these |z| insertions takes 
0(log |j|) < O(logn) time. We then update the other el- 
ements of the jth row, of which there are at most |«| + \j\, 
according to Eqs. I|10b|) and I jlOcj l. In the A;th row, we up- 
date a single element, taking 0(log |fc|) < O(logn) time, 
and there are at most \i\ + \j\ values of k for which we 
have to do this. All of this thus takes 0((|«| + |j|) logn) 
time. 

We also have to update the max-heaps for each row and 
the overall max-heap H . Reforming the max-heap corre- 
sponding to the jth row can be done in 0(|j|) time |35| . 
Updating the max-heap for the fcth row by inserting, rais- 
ing, or lowering AQ k j takes 0(log|/c|) < O(logn) time. 
Since we have changed the maximum element on at most 
\i\ + |j| rows, we need to do at most \i\ + \j\ updates 
of H, each of which takes O(logn) time, for a total of 
0((\i\ + \j\)logn). 

Finally, the update a'j = aj + a; (and a t — 0) is trivial 
and can be done in constant time. 

Since each join takes 0((\i\ + |j|) logrt) time, the total 
running time is at most O(logn) times the sum over all 
nodes of the dendrogram of the degrees of the correspond- 
ing communities. Let us make the worst-case assumption 
that the degree of a community is the sum of the degrees 
of all the vertices in the original network comprising it. In 
that case, each vertex of the original network contributes 
its degree to all of the communities it is a part of, along 
the path in the dendrogram from it to the root. If the 
dendrogram has depth d, there are at most d nodes in 
this path, and since the total degree of all the vertices 
is 2m, we have a running time of 0(md\ogn) as stated. 

We note that, if the dendrogram is unbalanced, some 
time savings can be gained by inserting the sparser row 
into the less sparse one. In addition, we have found that 
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FIG. 1: The modularity Q over the course of the algorithm 
(the x axis shows the number of joins) . Its maximum value is 
Q — 0.745, where the partition consists of 1684 communities. 

in practical situations it is usually unnecessary to main- 
tain the separate max-heaps for each row. These heaps 
are used to find the largest element in a row quickly, but 
their maintenance takes a moderate amount of effort and 
this effort is wasted if the largest element in a row does 
not change when two rows are amalgamated, which turns 
out often to be the case. Thus we find that the following 
simpler implementation works quite well in realistic sit- 
uations: if the largest element of the kth row was AQ^i 
or AQkj and is now reduced by Eq. (|10b|) or (|10c() . we 
simply scan the fcth row to find the new largest element. 
Although the worst-case running time of this approach 
has an additional factor of n, the average-case running 
time is often better than that of the more sophisticated 
algorithm, ft should be noted that the hierarchies gen- 
erated by these two versions of our algorithm will differ 
slightly as a result of the differences in how ties are bro- 
ken for the maximum element in a row. However, we find 
that in practice these differences do not cause significant 
deviations in the modularity, the community size distri- 
bution, or the composition of the largest communities. 

III. AMAZON.COM PURCHASING NETWORK 

The output of the algorithm described above is pre- 
cisely the same as that of the slower hierarchical algo- 
rithm of |32j | . The much improved speed of our algorithm 
however makes possible studies of very large networks for 
which previous methods were too slow to produce useful 
results. Here we give one example, the analysis of a co- 
purchasing or "rccommcndcr" network from the online 
vendor Amazon.com. Amazon sells a variety of products, 
particularly books and music, and as part of their web 
sales operation they list for each item A the ten other 
items most frequently purchased by buyers of A. This 




FIG. 2: A visualization of the community structure at max- 
imum modularity. Note that the some major communities 
have a large number of "satellite" communities connected only 
to them (top, lower left, lower right). Also, some pairs of ma- 
jor communities have sets of smaller communities that act 
as "bridges" between them (e.g., between the lower left and 
lower right, near the center). 



information can be represented as a directed network in 
which vertices represent items and there is a edge from 
item A to another item B if B was frequently purchased 
by buyers of A. In our study we have ignored the directed 
nature of the network (as is common in community struc- 
ture calculations), assuming any link between two items, 
regardless of direction, to be an indication of their simi- 
larity. The network we study consists of items listed on 
the Amazon web site in August 2003. We concentrate on 
the largest component of the network, which has 409 687 
items and 2 464 630 edges. 

The dendrogram for this calculation is of course too 
big to draw, but Fig.Q]illustrates the modularity over the 
course of the algorithm as vertices are joined into larger 
and larger groups. The maximum value is Q = 0.745, 
which is high as calculations of this type go 0, 13^ 
and indicates strong community structure in the network. 
The maximum occurs when there are 1684 communities 
with a mean size of 243 items each. Fig.|21gives a visual- 
ization of the community structure, including the major 
communities, smaller "satellite" communities connected 
to them, and "bridge" communities that connect two ma- 
jor communities with each other. 

Looking at the largest communities in the network, we 
find that they tend to consist of items (books, music) in 
similar genres or on similar topics. In Table [H we give in- 
formal descriptions of the ten largest communities, which 
account for about 87% of the entire network. The remain- 
der is generally divided into small, densely connected 
communities that represent highly specific co-purchasing 
habits, e.g., major works of science fiction (162 items), 
music by John Cougar Mellencamp (17 items), and books 
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Rank 


Size 


Description 


1 


114538 


General interest: politics; art/literature; general fiction; human nature; technical books; how things, 






people, computers, societies work, etc. 


2 


92276 


The arts: videos, books, DVDs about the creative and performing arts 


3 


78661 


Hobbies and interests I: self-help; self-education; popular science fiction, popular fantasy; leisure; etc. 


4 


54582 


Hobbies and interests II: adventure books; video games/comics; some sports; some humor; some classic 






fiction; some western religious material; etc. 


5 


9872 


classical music and related items 





1904 


children's videos, movies, music and books 


7 


1493 


church/religious music; African-descent cultural books; homoerotic imagery 


8 


1101 


pop horror; mystery /ad venture fiction 


9 


1083 


jazz; orchestral music; easy listening 


10 


947 


engineering; practical fashion 



TABLE I: The 10 largest communities in the Amazon.com network, which account for 87% of the vertices in the network. 
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community size 

FIG. 3: Cumulative distribution of the sizes of communities 
when the network is partitioned at the maximum modularity 
found by the algorithm. The distribution appears to follow 
a power law form over two decades in the central part of its 
range, although it deviates in the tail. As a guide to the 
eye, the straight line has slope —1, which corresponds to an 
exponent of a = 2 for the raw probability distribution. 

about (mostly female) spies in the American Civil War 
(13 items). It is worth noting that because few real- 
world networks have community metadata associated 
with them to which we may compare the inferred com- 
munities, this type of manual check of the veracity and 
coherence of the algorithm's output is often necessary. 

One interesting property recently noted in some net- 
works j33, H3 is that when partitioned at the point 
of maximum modularity, the distribution of community 
sizes s appears to have a power-law form P(s) ~ s~ a 
for some constant a, at least over some significant range. 
The Amazon co-purchasing network also seems to ex- 
hibit this property, as we show in Fig. [3J with an expo- 
nent a ~ 2. It is unclear why such a distribution should 
arise, but we speculate that it could be a result either of 



the sociology of the network (a power-law distribution in 
the number of people interested in various topics) or of 
the dynamics of the community structure algorithm. We 
propose this as a direction for further research. 



IV. CONCLUSIONS 

We have described a new algorithm for inferring com- 
munity structure from network topology which works by 
greedily optimizing the modularity. Our algorithm runs 
in time O (m<f log n) for a network with n vertices and 
m edges where d is the depth of the dendrogram. For 
networks that are hierarchical, in the sense that there 
are communities at many scales and the dendrogram is 
roughly balanced, we have d ~ logn. If the network is 
also sparse, m ~ n, then the running time is essentially 
linear, 0(nlog 2 n). This is considerably faster than most 
previous general algorithms, and allows us to extend com- 
munity structure analysis to networks that had been con- 
sidered too large to be tractable. We have demonstrated 
our algorithm with an application to a large network of 
co-purchasing data from the online retailer Amazon.com. 
Our algorithm discovers clear communities within this 
network that correspond to specific topics or genres of 
books or music, indicating that the co-purchasing ten- 
dencies of Amazon customers are strongly correlated with 
subject matter. Our algorithm should allow researchers 
to analyze even larger networks with millions of vertices 
and tens of millions of edges using current computing re- 
sources, and we look forward to seeing such applications. 
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